In [6]:
library(tidyverse)
library(repr)
library(ggplot2)
library(patchwork)
library(caret)
library(glmnet)
library(rsample)
library(car)
options(repr.plot.width = 16, repr.plot.height = 10)
Loading required package: carData


Attaching package: ‘car’


The following object is masked from ‘package:dplyr’:

    recode


The following object is masked from ‘package:purrr’:

    some


STAT 301 PROJECT PART 1¶

DATA DESCRIPTION¶

The data set used in this project is called Airbnb Prices in European Cities. This data provides Airbnb listings from popular European cities which are evaluated across 22 variables. These variables include:

  • ID: Observation ID [Numerical - Discrete]
  • realSum: The full price of the listing in EUR [Numerical - Continuous]
  • room_type: Categorizes the room type of the Airbnb as either Private Room, Shared Room, or Entire Home/Apt [Categorical]
  • room_shared: Indicates whether the room is shared (TRUE/FALSE) [Boolean/Binary]
  • room_private: Indicates whether the room is private (TRUE/FALSE) [Boolean/Binary]
  • person_capacity: Maximum number of guests allowed [Numerical - Discrete]
  • host_is_superhost: Indicates whether the host has superhost status (TRUE/FALSE) [Boolean/Binary]
  • multi: Differentiates whether the listing is for multiple rooms (0 = No, 1 = Yes) [Boolean/Binary]
  • biz: Indicates if the listing is for business purposes (0 = No, 1 = Yes) [Boolean/Binary]
  • cleanliness_rating: The cleanliness rating of the Airbnb listing [Numerical - Continuous]
  • guest_satisfaction_overall: The overall rating of the listing [Numerical - Continuous]
  • bedrooms: Number of bedrooms in the listing (0 for studios) [Numerical - Discrete]
  • dist: Distance from the city center in km [Numerical - Continuous]
  • metro_dist: The distance of the listing from the nearest metro station in km [Numerical - Continuous]
  • attr_index: The attraction index of the listing location [Numerical - Continuous]
  • attr_index_norm: The normalized attraction index (0-100) [Numerical - Continuous]
  • rest_index: The restaurant index of the listing location [Numerical - Continuous]
  • rest_index_norm: The normalized restaurant index (0-100) [Numerical - Continuous]
  • lng: Longitude of the listing location [Numerical - Continuous]
  • lat: Latitude of the listing location [Numerical - Continuous]
  • City: The city in which the listing is located (London, Rome, Budapest) [Categorical]
  • Time: The time of the week of the listing (Weekend, Weekday) [Categorical]

After combining 6 smaller sets of data from London, Budapest and Rome, using both weekend and weekday data we created a combined dataset with a total number of 23,042 observations.

PRE-SELECTION OF VARIABLES¶

For my analysis project I excluded the following variables: rest_index, attr_index, room_shared and room_private. I removed rest_index and attr_index as their normalized versions which are included in the dataset, provide a more meaningful and standardized measure, making the raw versions redundant. I also omitted room_shared and room_private since the variable room_type already categorizes the listing into private, shared, or entire home/apt, making these additional classifications unnecessary for my analysis. It is also not clear if the room_privite and room_shared variables were classifying a different aspect of the room and therefore I assumed it was covering the same information. I also chose to remove the categorical level of room_type = "Shared room" because the number of observations that fit into this category is significantly smaller than the other categories of room_type = "Private room" and room_type = "Entire home/apt". Therefore if I kept this data in the dataset I could end up in a situation where when I split the data into training and testing all the "Shared room" observations might fall into the testing set and my model would not recognize these points.

QUESTION¶

Part 1¶

Using this data, I want to examine the association between the variable of guest satisfaction (response) and predictors related to the price of the listing, which European city it belongs to (e.g., London, Rome, Budapest), its location (e.g., City, lng/lat coordinates, distance from the city centre) and the type of Airbnb listing (e.g., room type, if it’s for business, multiple rooms or not).

Part 2 (I changed the question)¶

Using this dataset, I aim to develop a model to figure out if I can predict guest satisfaction based on factors such as listing price, city (e.g., London, Rome, Budapest), location characteristics (e.g., latitude/longitude, distance from the city center), and Airbnb listing type (e.g., room type, business listing, multiple rooms or not).

Response Variable:¶

guest_satisfaction_overall

Primary Focus¶

Part 1 (corrected the word choice of influence to associate)¶

My question is focused on inference because I am interested in understanding the relationship between all my variables and how they relate to and (association) Airbnb guest satisfaction. I want to find generalizable conclusions about the factors that drive guest satisfaction.

Part 2 (I changed the focus of my project)¶

My question is focused on prediction because I am interested in understanding the relationship between all my variables and how they relate to and cause Airbnb guest satisfaction levels. I want to use my model to predict guest satisfaction based on multiple variables.

EXPLORATORY DATA ANALYSIS AND VISUALIZATION¶

WRAGLING AND CLEANING DATA¶

In [2]:
#Loading in the 6 data sets 
#Corrected the importing of data to be reproducible 
budapest_weekdays <- read_csv("https://raw.githubusercontent.com/AngelycaP/STAT-301/refs/heads/main/budapest_weekdays.csv")
budapest_weekends <- read_csv("https://raw.githubusercontent.com/AngelycaP/STAT-301/refs/heads/main/budapest_weekends.csv")
london_weekdays <- read_csv("https://raw.githubusercontent.com/AngelycaP/STAT-301/refs/heads/main/london_weekdays.csv")
london_weekends <- read_csv("https://raw.githubusercontent.com/AngelycaP/STAT-301/refs/heads/main/london_weekends.csv")
rome_weekdays <- read_csv("https://raw.githubusercontent.com/AngelycaP/STAT-301/refs/heads/main/rome_weekdays.csv")
rome_weekends <- read_csv("https://raw.githubusercontent.com/AngelycaP/STAT-301/refs/heads/main/rome_weekends.csv")

#Adding two new columns to each of the datasets 
budapest_weekdays_data <- budapest_weekdays %>% mutate(City = "Budapest", Time = "Weekday")
budapest_weekends_data <- budapest_weekends %>% mutate(City = "Budapest", Time = "Weekend") 
london_weekdays_data <- london_weekdays %>% mutate(City = "London", Time = "Weekday")
london_weekends_data <- london_weekends %>% mutate(City = "London", Time = "Weekend") 
rome_weekdays_data <- rome_weekdays %>% mutate(City = "Rome", Time = "Weekday")
rome_weekends_data <- rome_weekends %>% mutate(City = "Rome", Time = "Weekend") 

#Combining all the data sets into a tidy format
airbnb_combined <- bind_rows(budapest_weekdays_data, budapest_weekends_data, 
                             london_weekdays_data, london_weekends_data, 
                             rome_weekdays_data, rome_weekends_data)

airbnb_combined <- airbnb_combined |>
                rename(ID = ...1) |>
                filter(room_type != "Shared room")|>
                select(-rest_index, -attr_index, -room_shared, -room_private)

head(airbnb_combined)

#Inspecting the data
airbnb_combined %>% count

any(is.na(airbnb_combined))
New names:
• `` -> `...1`
Rows: 2074 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 1948 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 4614 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 5379 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 4492 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 4535 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 18
IDrealSumroom_typeperson_capacityhost_is_superhostmultibizcleanliness_ratingguest_satisfaction_overallbedroomsdistmetro_distattr_index_normrest_index_normlnglatCityTime
<dbl><dbl><chr><dbl><lgl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
0238.9905Entire home/apt6 TRUE01109910.35935500.3526430 24.11655267.6568519.0507447.50076BudapestWeekday
1300.7943Entire home/apt6FALSE01 99820.92942720.2002355100.00000034.2677019.0449347.50405BudapestWeekday
2162.3819Entire home/apt4 TRUE00109812.45084030.2794518 9.75555114.5382519.0217047.49882BudapestWeekday
3118.4377Entire home/apt2FALSE00 99211.55944940.4779711 11.43315524.7020519.0630147.51126BudapestWeekday
4134.4174Entire home/apt4 TRUE10109921.11380300.2701016 11.84365848.1232219.0690047.49900BudapestWeekday
5127.3676Entire home/apt4FALSE10 99120.26847030.1669317 37.90590376.1511819.0548047.50094BudapestWeekday
A tibble: 1 × 1
n
<int>
22966
FALSE

Visualization¶

In [3]:
scatter_plot <- ggplot(airbnb_combined, aes(x = log(realSum), y = guest_satisfaction_overall, color = City , shape= room_type)) + 
# I have logged realSum (the price of the Airbnb listing) because this variable has extreme values therefore 
# applying the log function will compress large values while maintaining the overall trends
  geom_jitter(alpha = 0.3, width = 0.3, height = 0.3) + # jitter is used in order to spread some of the points and reduce overlapping
#I set alpha to 0.3 because lowering the alpha (opacity) makes denser areas more visible
  labs(x = "Price (Logged, EUR)", 
       y = "Guest Satisfaction", 
       color = "City", 
       shape= "Room Types",
      title = "How Price, City, and Room Type Affect Guest Satisfaction") +
      theme_minimal() +
      theme(legend.position = "right", 
      plot.title = element_text(size = 14, face = "bold"))


box_plot <- ggplot(airbnb_combined, aes(x = City, y = guest_satisfaction_overall, fill = City)) +
  geom_boxplot(outlier.shape = 16, outlier.colour = "red", outlier.size = 3, alpha = 0.5) +  
  labs(x = "City", y = "Guest Satisfaction", title = "City and Guest Satisfaction") +
  theme_minimal() +
  scale_fill_manual(values = c("skyblue", "pink", "lightgreen"))

box_plot2 <- ggplot(airbnb_combined, aes(x = room_type, y = guest_satisfaction_overall, fill = room_type)) +
  geom_boxplot(outlier.shape = 16, outlier.colour = "red", outlier.size = 3, alpha = 0.5) +  
  labs(x = "Room Type", y = "Guest Satisfaction", title = "Room Type and Guest Satisfaction") +
  scale_fill_manual(values = c("orange", "purple")) + 
  theme_minimal()

box_plot + box_plot2 + scatter_plot
No description has been provided for this image

VISUALIZATION EXPLANATION¶

I have created a scatterplot to explore all 4 of my variables, and then two box plots to explore the relationship between my categorical variables and guest satisfaction. I believe these are relevant as the box plots show that the median guest satisfaction is different for each city but there is a lot of overlap and my other box plot shows that my two room types have very similar medians and overlap a lot, this tells me how much each variable plays a role in guest satisfaction. The scatterplot gives me a big picture of what is going on in my data. I had a lot of overlap so I can draw insight from that.

METHODS AND PLANS¶

Why is this method appropriate?¶

To develop a predictive model for guest satisfaction, a continuous variable, I will use predictors such as the price of the listing, city (e.g., London, Rome, Budapest), location (e.g., city center distance), and listing characteristics (e.g., room type, business suitability, number of rooms). Lasso regression will be used, as it is particularly well-suited for this task by penalizing large coefficients, thus improving model interpretability and generalizability. Before applying Lasso, I will check for multicollinearity using Variance Inflation Factor (VIF). VIF values higher than 5 or 10 suggest multicollinearity issues that might require addressing. After checking VIF, I will split the data into a training set (80%) and a testing set (20%). The training data will be standardized and used to fit the Lasso regression model via the glmnet function. I will select the optimal lambda (penalty term) through cross-validation. Finally, I will evaluate the model's performance using the testing set and assess its prediction accuracy, particularly by computing the Root Mean Squared Error (RMSE).

Which assumptions are required, if any, to apply the method selected?¶

To use an MLR model I need to assume the following:

  • Linearity: The relationship between the dependent variable (guest_satisfaction_overall) and each independent variable should be linear.
  • Independence: Observations should be independent of each other.
  • Homoscedasticity: The variance of residuals should remain constant across predictions.
  • Normality of Residuals: Residuals should follow a normal distribution
  • No extreme outliers

What are the potential limitations or weaknesses of the method selected?¶

Potential Limitations:

  • Airbnb guest satisfaction ratings tend to cluster around high values and values that are multiples of 10. Leading to a skewed distribution of the response variable. This may violate normality assumptions.

  • Multicollinearity: Some predictors (e.g., location-based features) might be highly correlated, requiring ridge or lasso regression to improve model stability. Multicollinearity occurs when there is a strong association between two or more covariates. These covariates bring similar information to the model, and we have trouble isolating their effect.

Lasso will be used to mitage the previous limitation however this method adds its on limitations to the mix.

  • Lasso can shrink coefficients of predictors too much, sometimes pushing important predictors to zero. This could result in an overly simplified model, losing useful information for prediction or interpretation.
  • Lasso introduces bias into the model coefficients, especially for highly correlated predictors. While this helps in reducing overfitting, it can distort the true relationship between predictors and the outcome.
  • Lasso assumes a linear relationship and does not automatically account for interaction terms between predictors. If there are important interactions, Lasso may not capture them unless explicitly included in the model.
In [14]:
# Set seed for reproducibility
set.seed(123)

# Split the data into training (80%) and testing (20%) sets
airbnb_split <- 
    airbnb_combined %>%
    initial_split(prop = 0.8, strata = guest_satisfaction_overall)

train_data <- training(airbnb_split)
test_data <- testing(airbnb_split)

#I will now conduct further pre selection of variables. I will remove ID to prevent non-informative columns from affecting the model. 
#I will also remove the columns lng and lat because we you haven't learned how to use spatial analysis variables.

train_data <- train_data %>% select(-ID, -lng, -lat)
test_data <- test_data %>% select(-ID, -lng, -lat)
In [15]:
## I will conduct an initial check for multicollinearity using VIF. Values higher than 5 or 10 suggests multicollinearity might be problematic.
mlr_model <- lm(guest_satisfaction_overall ~ ., data = train_data)  # Fit model with all predictors
vif(mlr_model)
## I have found low to moderate multicollinearity therefore I will use Lasso for further variable selection.
A matrix: 14 × 3 of type dbl
GVIFDfGVIF^(1/(2*Df))
realSum1.28473011.133459
room_type1.72359011.312856
person_capacity2.15965711.469577
host_is_superhost1.13405111.064918
multi1.43274111.196971
biz1.56087511.249350
cleanliness_rating1.12034911.058465
bedrooms1.59170611.261628
dist4.49450412.120025
metro_dist1.88056611.371337
attr_index_norm3.39120611.841523
rest_index_norm2.81801511.678694
City5.31454921.518332
Time1.00510811.002551

COMPUTATIONAL CODE AND OUTPUT¶

In [26]:
X_train <- model.matrix(guest_satisfaction_overall ~ ., data = train_data)[, -1]  # Remove intercept
y_train <- train_data$guest_satisfaction_overall
X_test <- model.matrix(guest_satisfaction_overall ~ ., data = test_data)[, -1]

# Standardize predictors
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test, center = attr(X_train_scaled, "scaled:center"), scale = attr(X_train_scaled, "scaled:scale"))

# Perform Lasso Regression with cross-validation
cv_lasso <- cv.glmnet(X_train_scaled, y_train, alpha = 1)  # Lasso (alpha = 1)

# Get the best lambda
best_lambda_lasso <- cv_lasso$lambda.min

# Train the final Lasso model
lasso_model <- glmnet(X_train_scaled, y_train, alpha = 1, lambda = best_lambda_lasso)

# Predict on the test set
lasso_preds <- predict(lasso_model, s = best_lambda_lasso, newx = X_test_scaled)

# Compute RMSE for Lasso
rmse_lasso <- sqrt(mean((lasso_preds - test_data$guest_satisfaction_overall)^2))

# Print RMSE
print(paste("RMSE (Lasso):", round(rmse_lasso, 3)))

# Create a data frame with actual and predicted values
prediction_data <- data.frame(Actual = test_data$guest_satisfaction_overall, 
                              s1 = lasso_preds)

#Scatter plot: Predicted vs Actual
plot <- ggplot(prediction_data, aes(x = Actual, y = s1)) +
  geom_jitter(alpha = 0.3, width = 0.3, height = 0.3) + # jitter is used in order to spread some of the points and reduce overlapping
  geom_point(color = "blue", alpha = 0.5) +  # Points for Predicted vs Actual values
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +  # Line of perfect prediction
  labs(title = "Predicted vs Actual Guest Satisfaction", 
       x = "Actual Guest Satisfaction", 
       y = "Predicted Guest Satisfaction") +
  theme_minimal() +
  theme(legend.position = "right", 
  plot.title = element_text(size = 14, face = "bold"))

plot
[1] "RMSE (Lasso): 6.128"
No description has been provided for this image

Visualization Interpretation¶

If this model was perfect all points would fall on the red line, instead we have a spread around the line, indicating prediction error. This model predicts high guest satisfaction fairly well, because most of our training data consited of higher ratings, which is also why this model struggles to predict low satisfaction scores often overestimating them. The predictions are also compressed toward the average, a common effect of Lasso regression shrinking extreme values. To improve, I would need to adress the imbalance in low-rated listings.

REFERENCE¶

Airbnb prices in European cities. (2024, March 10). Kaggle. https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities

Gyódi, K., & Nawaro, Ł. (2021). Determinants of Airbnb prices in European cities: A spatial econometrics approach (Supplementary Material) [Dataset]. In Zenodo (CERN European Organization for Nuclear Research). https://doi.org/10.5281/zenodo.4437019